Using k-Means? Consider ArrayMiner
نویسندگان
چکیده
With the (near-)availability of complete sequences of human and other genomes, research effort is turning from obtaining the sequence itself towards determining the biological function of the genes within the sequence. One approach, made practically feasible with the advent of DNA microarray technology, consists of clustering of genes into groups of coexpressed genes, i.e. genes exhibiting similar behavior in some circumstances. Among the various methods used for identification of groups of coexpressed genes, k-Means is one of the most popular. However, we show here that k-Means is a highly unreliable method of clustering, yielding high-quality solutions with low probability. We then present the ArrayMiner software based on Genetic Algorithms, which addresses the drawback, supplying high-quality solutions in short time with very high reliability. 1. Clustering of Expression Profiles With the (near-)availability of complete sequences of human and other genomes such as Drosophila and Arabidopsis, genomics has produced a significant wealth of sequence data, and the stage has been set for the next task, namely identifying the biological function of the genes within those sequences. Indeed, only this latter knowledge will enable researchers to establish correspondences between diseases and the genome, paving the route to new medication. A major difficulty lies with the fact that the detailed phenomena taking place within organisms are not fully understood and many probably remain to be discovered. Identifying the phenomena and the genes involved, together with the ways in which the genes interact, constitutes the major challenge of “post-sequencing” genetics. One approach to achieve this consists in identification of groups of coexpressed genes, i.e. groups of genes that behave similarly in some conditions. The rationale behind this approach is that similarly behaving genes probably participate together in some phenomenon or have similar functions – identifying such groups may thus help in identifying the phenomenon, discovering previously unknown phenomena, or assigning previously unknown functions to genes. As an additional benefit, clustering gene expression data into groups reduces the unmanageable volume of data into data sets that can be more easily handled by biologists. In this approach, each gene (or ORF) is represented by an expression profile, i.e. a series of numerical values of its activity. The profile of a gene may correspond to a set of readings of the gene’s activity under various conditions, or be a time-series of the gene’s activity after some event. The aim of clustering of the expression profiles being to decide which genes exhibit similar behaviors (i.e., are coexpressed), a measure of similarity between profiles must be defined. Several can be found in the literature, including the Euclidean distance, the correlation, and the Pearson coefficient, this last measure being widely used for the time-series profiles, as it measures the similarity of the shapes, rather than the absolute values of the measurements of the two profiles. 2. The k-Means Technique
منابع مشابه
A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملIncreasing the Accuracy of Recommender Systems Using the Combination of K-Means and Differential Evolution Algorithms
Recommender systems are the systems that try to make recommendations to each user based on performance, personal tastes, user behaviors, and the context that match their personal preferences and help them in the decision-making process. One of the most important subjects regarding these systems is to increase the system accuracy which means how much the recommendations are close to the user int...
متن کاملA hybrid DEA-based K-means and invasive weed optimization for facility location problem
In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...
متن کاملDesigning an Algorithm for Cancerous Tissue Segmentation Using Adaptive K-means Cluttering and Discrete Wavelet Transform
Background: Breast cancer is currently one of the leading causes of death among women worldwide. The diagnosis and separation of cancerous tumors in mammographic imagesrequire accuracy, experience and time, and it has always posed itself as a major challenge to the radiologists and physicians. Objective: This paper proposes a new algorithm which draws on discrete wavelet transform and adaptive ...
متن کامل